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Abstract 

We propose a method for segmentation of ex- 
pository texts based on hierarchical agglomera- 
tive clustering. The method uses paragraphs as 
the basic segments for identifying hierarchical 
discourse structure in the text, applying lexical 
similarity between them as the proximity test. 
Linear segmentation can be induced from the 
identified structure through application of two 
simple rules. However the hierarchy can be used 
also for intelligent exploration of the text. The 
proposed segmentation algorithm is evaluated 
against an accepted linear segmentation method 
and shows comparable results. 



Introduction 

The interest in expository texts comes, among 
others, from their widespread use as onhne infor- 
mation resources. Information retrieval gives us 
methods for scoring document collections based 
on their relevance to a query. However once we 
are about to browse a document, or extract some 
specific information from it, an interest arises for 
deeper analysis of the text. 

The kind and depth of the analysis depends on 
the reader. In the case of text extraction, the 
"reader" is the text understanding system, which 
implies a need for rather deep semantic analysis 
(Pwanska fc al. 91 ; Soderland & Lehnert 94; Hahn 
|90|). However for (human) browsing and reading 



in free expository texts, we can suffice with delin- 
eating the structure of the text and provide easy 
access to the discovered substructures. This kind 
of discourse segmentation is thus a critical task 
for exploratory reading of the retrieved text. 

This article presents an approach for discover- 
ing discourse structure in free expository texts. 
The identified structure can be used in various 
tasks such as text browsing and summarization. 
Section || surveys other approaches to discourse 
segmentation, in particular those based on lexi- 
cal cohesion metrics. Section Q details the pro- 
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posed method for identifying a hierarchical dis- 
course structure in the text, based on the hier- 
archical agglomerative clustering method, while 
using common lexical cohesion metrics. We show 
how, through application two simple rules, linear 
discourse segmentation can be recognized in the 
discovered structure. The output of the algorithm 
is analyzed and evaluated in section ^. Finally 
section Q concludes the paper and outlines future 
work. 

1 Discourse Segmentation Methods 

Two main approaches can be seen in discourse 
segmentation of free text: the multiple-source ap- 
proach where multiple kinds of evidence are used 
to determine discourse boundaries and relations, 
and the lexical cohesion approach, where lexical 
cohesion (or lexical similarity) is the sole criteria 
for boundary detection. These two approaches 
are discussed in the following subsections. 

1.1 Multiple Source Methods 

With the dominant discourse analysis theories of 
today ( prosz fc Sidner 86 ; Mann &: Thompson 87| ; 
Prosz et al. 95| ) there is no simple computational 
way to determine the detailed discourse structure 
from the free text. Detailed structure may in- 
clude participants intentions, coherent discourse 
segments, their functions and inter-relations. 

To reach this level of understanding, re- 
searchers are forced to use multiple sources of 
evidence and apply it in some adaptive manner. 
A good example is in ( pi/itman fc Passoneau 94 ) 
which uses prosodic features, cue phrases, and NP 
references, and applies machine learning methods 
for the analysis of verbal discourse. Another ap- 
proach ( [Kurohashi &: Nagao 94 ; Miike et al. 94) is 
to identify inter-sentential discourse relationships 
using sources of evidence like cue phrases, topic 
words/phrases, and grammatical and lexical simi- 
larity between sentences. A complex decision rule 
sets was developed to map cue word patterns to 



potential discourse functions or relationships. 
1.2 Lexical Cohesion Methods 



ually. A more practical approach is to build, from 
the text, a graph with paragraphs at the nodes. 



and their lexical similarity at the edges ( Salt on 



Wl [en the source text is truly tree text, methods Allan 94| ). By setting a threshold we can than 



that attempt to identify detailed discourse struc- 
ture tend to be brittle. Researchers then turn to 
simpler but reliable evidence, in particular, lexical 
cohesion. 

Lexical cohesion between two discourse seg- 
ments is an indicator of textual coherence, and is 
achieved when the segments contain words which 



are similar or semantically related ( Halliday fc 



Hasan 76| ). We will say that a discourse bound 
ary exists between the two segments if the lexi- 
cal cohesion between them, as computed by some 
similarity metrics, for example using the cosine 
distance between the term vectors ( ^alton 89 ), 
falls below some threshold. 

Two related methods ( [Kozima" [Hearst 94|) 
use a concept of text window, within which they 
compute a lexical cohesion function. By moving 
the window W over the text, they form a lin- 
ear plot of the lexical cohesion as a function of 
the word position, Wi (where the window is cen- 
tered). A discourse boundary is assigned to Wi if 
that value falls below a threshold. Kozima uses 
a semantic network with words at the nodes and 
edges indicating their semantic relation as com- 
puted from a MRD. The lexical cohesion function 
computes Wi by spreading activation on the se- 
mantic network for each word in the window W 
and summing the output at Wi. Here, then, the 
lexical cohesion takes into account the similarity 
of the words based on their definition in a dictio- 
nary. Hearst splits the window W to two halves, 
the one to the left of Wi, and the one to its right, 
and determines the term vector of each. Term 
vectors consist simply the counts of each open 
class words in the window. She then computes 
the lexical cohesion function at Wi by evaluating 
the similarity between the two terms vector using 
the cosine distance formula. 

While the above methods produce linear seg- 
mentation, some attempts have been made to 
identify a more elaborated structure. The lexi- 



cal chaining method ( Morris fc Hirst 91 ) attempts 
to determine the hierarchical intention structure 
( prosz fc Sidner 86| ) by identifying lexical chains 
that run through the text. The lexical cohesion 
between words in a chain is determined using var- 
ious relations defined over the Roget thesaurus, 
however the algorithm is only implemented man- 



identify strongly-connected subgraphs which cor- 
respond to inter-related paragraphs. This struc- 
ture can be used both to improve text retrieval 
and for identifying themes for text browsing. 

2 Segmentation By Hierarchical 
Agglomerative Clustering 

The proposed segmentation process consists of 
three main phases: 

1. Morphological analysis. 

2. Hierarchical agglomerative clustering of text 
segments. 

3. Boundary detection. 

2.1 Morphological Analysis 

The purpose of this phase is to determine the 
terms to be used as content words in the following 
phase. The phase consists of the following steps: 

1. Tokenization. Convert the raw text, through 
regular expression recognizer, to streams of 
tokens: words, numbers and special symbols. 



2. Perform part-of-speech tagging ( Brill 9^ . 
This step filters open class words, adjectives, 
verbs, adverbs, and nouns, to the next step. 

3. Determine the general significance of each 
word i, Gsigi. In this experiment we use IDF 
as the measure for general significance, using 
frequency-in-files information from the BNC 
corpus ([Leech 92|): 



Gsigi = IDFi = log 



N 



(1) 



where is the number of files in the corpus 
and Ni is the number of files containing word 
i. 

4. Stemming. Replace each word by its stem, 
(Porter's algorithm is used here ( [Porter 80D ). 
The general significance Gsigi, associated 
with Tj, is the minimum Gsig over all words 
j having this stem: Gsigi = inmj^rj=ri Gsigj. 
This has the effect of counting all instances 
of a given stem as a single concept. 



2.2 Hierarchical agglomerative clustering 
of text segments 

The main motivation behind the proposed algo- 
rithm is discovering a structure in text. The 
bottom-up Hierarchical Agglomerative Cluster- 
ing (HAC) algorithm is a widely used clustering 
method in information retrieval ( [Everitt "80| ) , psy- 
chology (Milligan & Sokol 80), linguistics ( [Kesslei 
P^), and elsewhere. 

When applying hierarchical agglomerative clus- 
tering on text segments the algorithm successively 
grows areas of coherence at the most appropriate 
place, thus forming a text structure. A similar 
approach ( Maarek fc Wecker 94| ) uses HAC to de- 
termine a hierarchical bookshelf from a given set 
of documents. 

The HAC algorithm for discourse segmenta- 
tion, based on paragraphs as the elementary seg- 
ments, is shown in Figure 

Partition the text to elementary segments 
(=paragraphs). 

While more than one segment left do 

Apply a proximity test to find the two most 
similar consecutive segments, Si, Si+i , — 



Mergti Si, i'i+i into one yegmenl. 

End vi^hile 

Figure 1: Hierarchical agglomerative clustering of 
text segments 

Figure shows the result of the algorithm for 



the Stargazers text ( Hearst 94 ), in a dendrogram 
representation. Figure ^ shows the corresponding 
outline representation, which plots the depth of 
the nesting of the paragraphs in the dendrogram, 
that is, the path length from the paragraph node 
to the dendrogram root. The gray dashed vertical 
lines both plots indicate the segment boundaries. 
Determination of these boundaries is discussed in 
the next section. 

The algorithm successively grows "coherent" 
segments by appending lexically related para- 
graphs, or by merging larger segments. The re- 
sult is hierarchical structure , called dendrogram, 
where text segments correspond to its subtrees. 
We propose that the dendrogram represents the 
internal hierarchy of the text discourse, similar to 
an intention structure ( prosz fc Sidner 86| ). 



Using paragraphs as the elementary segments 
for the algorithm makes sense for a number of 
reasons. The paragraph is a universal linguis- 
tic structure, representing a coherent textual seg- 
ment ( Chafe 79| ; Longacre 79 ; Kieras 82 ). Al- 
lowing a boundary in the middle of the para- 
graph is thus counter to the author's intention. 
In addition, the size of a paragraph, unlike a sen- 
tence, contains sufficient lexical information for 
the proximity test. 

Note that unlike general HAC applications, 
where at each stage we compute the proximity of 
the newly merged object to all other available seg- 
ments, in our case we compute only the proximity 
of the segment to its two neighbors. This is be- 
cause we require that the linear order in the text 
will be preserved in the structure. The implica- 
tion on complexity is that while general HAC al- 
gorithm takes an order of 0{N'^) steps, ours takes 
only 0{N). 

The proximity test selects the closest pair of 
segments. The test is based on repetition of 
words, a well-recognized indicator for lexical co- 
hesion (see ( [Hearst 94 ) for more references). The 
test computes the cosine between the representa- 



tive term vectors of the segments (Salton & Buck- 



ley88D : 



Proximity{si, Sj+i) 



k=i 11'^* 



l^j+il 



(2) 



Where Si is the term vector representing segment 
i, \\si\\ is its length. 



k=l 



Wt 



■ , and Wk i is the 



weight of word k in segment i. 



fn 



Gsigi 



(3) 



The word weight Wk^i is the product of three fac- 
tors - fk,i, the frequency of the word in the seg- 
ment, serves as the in-segment factor, , the 

Jmax 

relative frequency of word i in the text, is an in- 
text factor, and Gsigi is the general word signifi- 
cance (^). 

2.3 Boundary Detection 

The algorithm for boundary detection in the den- 
drogram makes use of size and depth attributes of 
a segment. As indicated above, a segment corre- 
sponds to a subtree in the resulting dendrogram 
tree. The segment size is defined as the number 
of leaves, i.e. paragraphs, it contains. Its depth 
is defined as the longest path in the subtree from 




Figure 2: Paragraph dendrogram of the Stargazers article. The leaves in the dendrogram are para- 
graphs shown as a sequence of equal-length vertical lines - the paragraph's sentences. The scale below 
the X axis shows sentence numbers and the one above paragraph numbers (placed at end of the 
respective paragraphs). Gray dashed vertical lines show the computed boundaries. 
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Figure 3: Outline of the Stargazers article. The graph plots the depth of each paragraph in the 
dendrogram, i.e., its path length to the dendrogram root. The notches indicate the depth of the merge 
points. Gray dashed vertical lines are segment boundaries. Paragraphs marks are shown above the X 
scale, sentence marks below it 



the root to the leaves. Thus, a size 1 segment is 
a single paragraph. 

With these definitions, the algorithm for 
boundary detection is stated in Figure ^. 

For each node in the dendrogram tree T do 

Let Si and 5*2 be the two segments be- 
ing merged at the node, such that 
size{Si) > size{S2)- 
Set a boundary between the two segments 
if one of the following two rules holds: 
The notch rule: 

size[Si) > n A size{S2) > n 
The cliff" rule: 

size{Si) > n A size{S2) < n A 
depth{Si) — depth(S2) > m 

End for each 

Figure 4: Algorithm for identifying boundaries in 
a dendrogram 

The algorithm defines two rules to identify 
boundaries. The notch rule constrains segments 
across boundaries to be of a significant size. We 
found that n = 1 gives maximum boundary in- 
formation without adding false boundaries, that 
is, it allows a paragraph that is cohesive with its 
neighbor to be merged with it without creating 
a boundary between them. The cliff rule relaxes 
the notch rule, allowing one of the segments to 
be smaller than n if the difference between their 
depths is larger than a threshold m. Such bound- 
aries indicate remotely related segments and are 
seen as high cliffs in the outline plot. The mini- 
mum for m was set experimentally to depth{T)/5. 

Cases of the notch rule are seen, as the name 
implies, as deep notches (deeper than 1 (= n), in 
our case) in the outline view. See, for example. 
Figure ^, between paragraphs 11 and 12. Cases 
of the cliff rule may indicate a setting (or intro- 
duction) segment at the beginning of a larger text 
segment, or a summary segment at its end. These 
setting and summary segments consist of para- 
graphs, each discussing a different topic. This 
creates a build-up effect (in case of setting) or a 
fall-off effect (in case of summary). Cliff bound- 
aries happen less frequently than notches. For 
example, in Figure |^ they appear between para- 
graph 3 and 4, and between paragraphs 18 and 
19. In the last case, paragraphs 19, 20 and 21 can 



be regarded as a conclusion section for the whole 
article. While the bulk of the article talks about 
the special case of the earth and the moon, and 
their life-enabling conditions, the last paragraphs 
summarize the conditions for life existence in a 
solar system and future research directions to be 
undertaken by astronomers. 

3 Evaluation 

We have used the Stargazers article, discussed in 
the previous section, as the test bench for evalu- 
ation. Stargazers is an expository text that dis- 
cusses the conditions fore evolution of life in solar 
systems. What makes it particularly useful is that 
segmentation data is available both as produced 
by Hearst's TextTiling algorithm, which is robust 
and gives good results, and as produced by human 



judges ( Hearst 94 ). 

Comparing the results of TextTiling and Hier- 
archical Agglomerative Clustering for boundary 
detection shows impressive matching. Table |l] 
compares the results of the two algorithms against 
those of the human judges. The boundaries for 
the human judges are those with agreement of 3 
or more among the 7 judges, and are considered 
the correct boundary set. The P and R columns 
give the precision and recall relative to that cor- 
rect set. These results, while not yet very ex- 





Boundaries 


P 


R 


Human judges^ 


2 3 5 8 9 12 13 16 18 


100 


100 


TextTiling 


3 5 9 11 13 16 18 20 


69 


56 


HAC 


2 3 5 9 11 13 16 18 


87 


78 



Table 1: Performance of discourse segmentation 
algorithms 

tensive, are encouraging. The reason for the good 
match between boundaries determined by the two 
algorithms is that in both cases boundaries are 
set when they separate segments of low lexical 
cohesion. The main difference is the way these 
segments are determined - fixed size in case of 
TextTiling, versus variable size in case of HAC. 

But the HAC algorithm provides richer infor- 
mation than just linear segmentation. The hier- 
archical clustering created by the algorithm iden- 
tifies a nested outline of the text. For exam- 
ple, we can deduce from the outline of Figure ^ 
that the {17.. .18} segment is part of a larger 
{14.. .18} segment. Indeed, the enclosing seg- 
ment is about binary/trinary systems while the 



subsegment {17. ..18} is about their low probabil- 
ity. Similarly we can deduce that the segment 
{12,13} is more lexically-related to {10,11} than 
to {14,15,16}. Another example is the typical 
build-up of a setting and fall-off of a summary, 
seen in coherent texts (see Figs. ^ and |5|). This 
information may help us later in constructing a 
table-of-content visual representation of the text. 

Figure |6| presents an outline of a case of "non- 
coherent" text. The article is not about any spe- 
cific subject but rather a survey of special events 
in genetic engineering during 1995. The outline 
shows deep notches, following paragraphs 13, 22, 
31, 35, and 49. These are the exact boundaries 
between the main articles in the text. Unlike the 
former examples, here there is no fall-off summary 
at the end. This is expected since the ending 
paragraphs are, in fact, a series of three tiny inde- 
pendent articles, following paragraphs 55, 57 and 
60. 

4 Conclusions and Future Work 

The main topic for research in the HAC algo- 
rithm is the proximity test. At the moment 
it is a rather simple lexical similarity test, so 
some modifications are possible in the way words 
are weighted (see Equations || and ^). A more 
radical approach is using concept vectors like in 
WordSpace ( Schutze"9^ ). Other sources of infor- 
mation can be used to complement lexical similar- 
ity. In particular, evidence involving cue phrases 
and part-of-speech patterns can be processed, us- 
ing previously-trained decision trees, to augment 
the lexical similarity function (( [Litman &: Pas" 



soneau 94| ) ) . 

Another research direction is table-of-content 
production. The clustering produced by the HAC 
algorithm provides the necessary structure infor- 
mation. The main task here, and a major research 
topic, is identification of topics, or titles, for the 
segments. 

Finally, while the comparison with the TextTil- 
ing algorithm and the human judges is promising, 
a methodical evaluation of additional texts is re- 
quired. 
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Figure 5: Outline of "How to Make a Desert", Discover Magazine, 2/96 
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Figure 6: Outline of "Special issue: The Year in Science - Genetics", Discover Magazine, 
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